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Abstract. In the last decade we have seen an enormous increase in the size and quality of spec- 
troscopic galaxy surveys, both at low and high redshift. New statistical techniques to analyse large 
portions of galaxy spectra are now finding favour over traditional index based methods. Here we 
will review a new robust and iterative Principal Component Analysis (PCA) algorithm, which solves 
several common issues with classic PCA. Application to the 4000A break region of galaxies in the 
VIMOS VLT Deep Survey (VVDS) and Sloan Digital Sky Survey (SDSS) gives new high signal-to- 
noise ratio spectral indices easily interpretable in terms of recent star formation history. In particular, 
we identify a sample of post-starburst galaxies at z ~ 0.7 and z ^ 0.07. We quantify for the first time 
the importance of post-starburst galaxies, consistent with being descendents of gas-rich major merg- 
ers, for building the red sequence. Finally, we present a comparison with new low and high redshift 
"mock spectroscopic surveys" derived from a Millennium Run semi-analytic model. 
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1. MOTIVATION 

The rest-frame optical spectrum of a galaxy contains a wealth of information about its 
past and present star formation rate, chemical evolution, dust content and the presence 
of an active galactic nucleus. The advent of the Sloan Digital Sky Survey (SDSS) led 
us into a new era in low redshift spectroscopic galaxy surveys, with the number of 
unique, well calibrated, high quality, galaxy spectra covering the full optical wavelength 
extent approaching 10^. At high redshift, progress in the last decade has been similarly 
significant. Both the Vimos VLT Deep Survey (VVDS) and DEEP2 surveys, which have 
released a large number of galaxy spectra to the public, allow the same measurement of 
galaxy physical parameters at z ~ 1 as routinely carried out at z ~ 0. 

At low redshift, the days of discreet classification of objects into red/blue, ellipti- 
cal/spiral etc. are over. The bimodality of the galaxy population is well quantified in 
physical parameters such as stellar mass and star formation rates. We are now in an era 
of "galaxy population" studies, in which each formerly distinct class is thought of as 
part of a wider community. Rare classes of objects, such as "starburst", "green valley" 
and "post-starburst" galaxies can be placed into a global picture of star formation pat- 
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FIGURE 1. Example main sequence stellar spectra in the wavelength region of the 4000A break. The 
stars are ordered by temperature, from hot to cold (i.e. shortest to longest main sequence lifetime). The 
4000A break strength (grey shaded areas) increases monotonically with decreasing temperature, while H5 
equivalent width (hatched areas) decreases for the cooler stars, thus both provide powerful age indicators 
for young to intermediate age stellar populations. 



terns. So-called "transition" galaxies are attracting great interest, due to their importance 
for understanding the physical mechanisms responsible for the global shut down in star 
formation, and the build-up of the red sequence [1,2]. 

In this proceedings we will review recent work on applying principal component anal- 
ysis (PCA) to the 4000A break region of galaxy spectra. This region provides constraints 
on the recent star formation history of galaxies, important for identifying transition 
galaxies. The 4000A break region is accessible to optical spectroscopic surveys between 
< z < 1, making observations directly comparable over half of the age of the Uni- 
verse [3], a time during which the star formation habits of the galaxy population change 
considerably. 

2. STAR FORMATION HISTORIES FROM GALAXY SPECTRA 

In the optical regime, the 4000A break region of the spectra contains the greatest amount 
of information on the recent star formation history of galaxies. Figure 1 illustrates this 
point. With decreasing stellar temperature the 4000A break strength increases, and the 
Balmer absorption lines first strengthen and then weaken. A strong UV continuum is 
evident in the hottest stars. Each stellar type has a characteristic main sequence lifetime, 
~0.5 Gyr and 3 Gyr for A and F stars respectively. It is these different lifetimes that 
allow us to measure the star formation histories of stellar populations. 

PCA is an unsupervised multivariate statistical method traditionally used to identify 
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FIGURE 2. The mean spectrum and top four eigenspectra for the VVDS galaxies. Left: The result from 
classic PCA on 3485 spectra. Center: The result from classic PCA with iterative removal of outliers. The 
final dataset contains 2675 spectra. Right: The result from the new iterative-robust PCA algorithm. 



correlations in datasets. It has been applied to astronomical spectral datasets for a variety 
of purposes [e.g. 4, 5, 6, 7, 8]. PCA identifies correlated features, such as the Balmer 
absorption lines, extracting them easily as a single parameter. When applied to the 
4000A break region of galaxy spectra, PCA recovers the strength of the 4000A break 
as the main axis of variation, which constrains the mean age of the stellar population, 
or equivalently the specific star formation rate [9]. The second axis of variation is the 
Balmer absorption line strength, which constrains the fraction of intermediate age stars 
[10]. PCA also identifies Ca ii(H&K) as a clearly interpretable third axis. 

In Wild et al. [11] we developed new PCA-based spectroscopic indices working in 
the 4000A break region, for the purpose of recovering the recent star formation history 
of galaxies.We showed that, by taking advantage of the entire Balmer series and contin- 
uum shape, a dramatic improvement could be achieved over the traditionally used H5 
equivalent width. For this first application we chose to create the PCA basis set using 
model galaxies from the Bruzual and Chariot [12] spectral synthesis models. On the one 
hand, an oft-quoted benefit of PCA is that it can be applied directly to the data, allowing 
the data to "speak for themselves". On the other hand, direct application of classic PCA 
to modem galaxy data is fraught with challenges caused by misclassified spectra, con- 
tamination from night sky lines, regions of missing data and enormous computational 
memory requirements. In the following section we will discuss a new robust and recur- 
sive PCA algorithm developed in Budavari et al. [13], designed specifically with large 
spectroscopic surveys in mind. 

3. ROBUST AND ITERATIVE PCA 

In classical PCA, a set of eigenvectors are calculated through a matrix decomposition of 
a single dataset. For very large samples, such as the SDSS galaxy catalog, information 
is very much redundant in the statistical sense, i.e. often the analysis of a smaller 
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subset yields as good results. Additionally, in most cases we seek only a small number 
of eigenvectors associated with the largest eigenvalues, in classic PC A the remaining 
vectors are computed in vain. 

The first step of our new method is to formulate the problem within the framework 
of a data stream, rather than the traditional data set. The eigensystem is recursively 
updated as new data are input, with only the top A'^ eigenvectors calculated as required. 
Convergence is controlled by a single parameter that sets the effective sample size and 
can easily be tuned to match the dataset being analysed. 

The second step is to make the algorithm robust to outliers. Classical PCA simply 
minimises the square of the residual between the eigensystem and input dataset, a 
statistical procedure inherently susceptible to outliers. With small datasets it is possible 
to remove the few "obvious" outliers by hand or within a few quick iterations of classic 
PCA, a step which is both subjective and impractical for modem large datasets. In 
the last few years, a number of improvements have been proposed to overcome the 
issue of robustness in PCA, within the framework of robust statistics [14]. Rather than 
minimising the square of the residuals, a robust function is introduced to control the 
level of contamination tolerated from outliers. 

Figure 2 presents a comparison between eigenspectra created from classic PCA (left), 
those created using a traditional workaround in which data are iteratively trimmed 
(center), and the new iterative and robust method (right). The dataset is a sample of 
3485 optical spectra from the VIMOS VLT Deep Survey (VVDS)[15] which will be 
introduced in the following section. A successful eigenbasis can be described as one 
that does not introduce noise into the decomposition of individual galaxy spectra, and 
in which the top few eigenspectra describe the variance in the majority of the dataset. 
The classic PCA fails both of these criteria: the eigenspectra are noisy, and the first two 
component amplitudes show significant correlation for good quality spectra. 

In this test case, the new eigenspectra are similar to those from the trimmed PCA 
but small improvements are apparent. It is worth noting that the PCA algorithm is 
completely independent of the order of the bins: it has no spatial coherence. Therefore, 
the fact that the new eigenspectra are smoother is already an indication that they are 
more robust. The following section presents the first application of indices derived using 
this method to a scientific problem. 

4. POST-STARBURST GALAXIES AT z ~ 0.7 

The VIMOS VLT Deep survey (VVDS) is a deep spectroscopic redshift survey, target- 
ting objects with apparent magnitudes in the range of 17.5 < Iab < 24 [15]. The survey 
is unique for high redshift galaxy surveys in having applied no colour cuts, yielding a 
particularly simple selection function, making it a very attractive dataset for statistical 
studies of the high redshift galaxy population. In this work we make use of the spectra 
from the publicly available first epoch data release of the VVDS-0226-04 (VVDS-02h) 
field. The spectra have a relatively low resolution of i? = 227 and observed frame wave- 

o 

length range of ~5500-9500A. The first epoch public data release contains 8981 spec- 
troscopically observed objects in the VVDS-02h field, from which we select 1246 with 
secure redshifts in the range 0.5 < z < 1.0 and a per-pixel signal-to-noise ratio (SNR) 
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FIGURE 3. The first two principal components (PCs) for different samples of ^1250 galaxies. The 
samples have been split into quiescent (red), starforming (cyan), star-bursting (blue) and post-starburst 
(orange) classes. Post-starburst galaxies are defined to lie within the dotted box indicated by at least 
la. Top Left: VVDS galaxies with 0.5 < z < 1.0. Post-starburst galaxies with SSFR< 10^'Vyr are 
circled. The median errors of the whole sample (black) and the post-starburst galaxies alone (orange) 
are shown in the top left. Top right: a comparison SDSS low redshift sample with 0.05 < z < 0.1 and 
log(M/Mg)> 9.75. The same number of galaxies as in the VVDS sample have been randomly selected 
for illustration purposes. The lower panels depict work in progress to recreate both VVDS and SDSS 
spectroscopic surveys from the Millennium Run semi-analytic model of De Lucia and Blaizot [16]. 



greater than 6.5. 

The new robust and iterative PCA method was applied to these spectra to create the 
eigenspectra directly from the VVDS spectra. In the top left panel of Figure 3 we show 
the distribution of the first two principal component amplitudes, PCI and PC2, for this 
sample of VVDS galaxies. The primary division of our sample is into "quiescent" and 
"star-forming" galaxies on the right and left. To the bottom left, "starburst" galaxies 
are found with very blue continua and very strong emission lines. To the top centre we 
find the "post-starburst" galaxies, with stronger Balmer absorption than expected for 
their 4000A break strength. In the top right panel we show a comparison sample derived 
from the SDSS DR6 catalog, 1246 galaxies with 0.05 < z < 0.1 have been selected 
randomly for illustration purposes^. There are some clear differences between the high 



We note that these samples are not mass limited, and therefore the completeness increases for a given 
stellar mass from right to left across the diagrams. However, the completeness hmits of the two samples 
are similar, and the left and right panels may be directly compared. 
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and low redshift samples. Firstly, the entire galaxy population noticeably ages (increases 
in PCI) as redshift decreases, the red sequence builds and the strong starburst galaxies 
are no longer visible in the SDSS survey. Secondly, the scatter in both the blue and red 
sequence drops substantially with decreasing redshift, even when the higher SNR of the 
SDSS spectra is accounted for. The SDSS shows a clear blue sequence, whereas the 
VVDS shows a "blue cloud". 

In Wild et al. [17] we compare with SPH simulations of galaxy mergers [18] to show 
that the strong post-starburst (PSB) galaxies found in the WDS and SDSS surveys are 
consistent with being the descendents of gas rich major mergers. The inclusion of black 
hole feedback does not greatly alter the evolution of the simulated merger remnants 
through the post-starburst phase. Starburst mass fractions must be larger than ~ 5 — 10% 
and decay times shorter than ~ 10^ years for post-starburst spectral signatures to be 
observed in the simulations. 

In the VVDS survey, we find 16 PSB galaxies above our mass completeness limit of 
log(M/MQ) > 9.75. These correspond to a number density of 1 x 10~^Mpc~^. Summing 
the mass of the 5 PSB galaxies which have completely ceased the formation of stars, 
and are therefore most likely to enter the red sequence, we measure a mass flux from 
blue to red sequence through the post-starburst phase of Pa^q.psb = O.OOBSIqqqj'* 
M0/Mpc^/yr. Comparison with Arnouts et al. [19] shows that this accounts for 38^^^% 
of the growth rate of the red sequence at z ~ 0.7. 

We find a very strong redshift evolution of these strong post-starburst galaxies: the 
number density is 200 times lower at z ~ 0.07 than at z ~ 0.7. In the redshift range 
0.05 < z < 0. 1, only 3 equivalently strong post-starburst galaxies are found in the SDSS- 
DR6 catalogue above our mass completeness limit of log(M/MQ)> 9.75. The strength 
of this evolution suggests that a combination of effects are responsible: declining merger 
rates [20], declining gas fractions and increasing disc dynamical timescales, leading to 
increasing burst durations and weakening burst strengths. 



4.1. Comparison with cosmological simulations 

In the lower panels of Figure 3 we present recent results from work-in-progress to ex- 
tract mock spectroscopic catalogues from the Millennium Run [21] semi-analytic model 
(SAM) of De Lucia and Blaizot [16]. Mock magnitude-limited catalogues are created 
using the MoMaf software [22], and the star formation and metallicity history of the 
SAM galaxies are combined with the Bruzual and Chariot [12] spectral synthesis mod- 
els to create integrated spectral energy distributions for each SAM galaxy. Prescriptions 
for nebular emission lines and two component dust attenuation are included [23]. The 
resulting spectra can then be analysed in an entirely equivalent manner to the real data. 
In the case of Figure 3, the SAM galaxy spectra have been projected onto the robust 
eigenspectra created from the VVDS galaxy spectra as described above. 

Comparing the upper (real) and lower (mock) panels of Figure 3 we can be encour- 
aged by the agreement between the SAM and real Universes. In particular, the overall 
ageing of the stellar populations is well reproduced, together with the decreasing scatter 
in the blue sequence as the Universe ages. Some shortcomings of the models are evi- 
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dent, and are currently being fully investigated. The figures make a strong case that the 
application of multivariate statistics to data compression and visualisation of large and 
complex datasets will lead to new insights into the physical processes responsible for 
driving the changes in star formation in galaxies over time. 

5. CONCLUSIONS 

In this proceedings we have gathered together recent and ongoing work by the authors, 
to quantify the recent star formation history of galaxies through use of the 4000A break 
region of galaxy spectra. Through a new robust PCA algorithm, we have derived high 
SNR spectroscopic indices from the high redshift VVDS galaxy survey. These indices 
combine the Balmer absorption line series and shape of the spectra to reveal the relative 
fractions of stellar types. We have shown how these indices can be applied to current 
scientific questions, by the identification of post-starburst galaxies. Finally, we have 
presented new work on extracting full spectral energy distributions for Millennium Run 
SAM galaxies, including prescriptions for dust attenuation and nebular emission. We 
have shown how the data compression and visualisation properties of PCA help us to 
compare directly these modem large cosmological simulations with the modem, and 
equally large, spectroscopic galaxy surveys. 
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